6 research outputs found

    PADIC: extension and new experiments

    Get PDF
    International audiencePADIC is a multidialectal parallel Arabic corpus. It was composed initially by five Arabic dialects, three from the Maghreb and two from the Middle East, in addition to standard Arabic. In this paper, we present an augmented version of PADIC with a Moroccan dialect. We give also an evaluation, using the σ–index, of the computerization level of the Arabic dialects present in PADIC which reveals that these languages are really under-resourced. Several experiments in machine translation, in both sides between all the combinations of language pairs, are discussed too. For each language, we interpolated the corresponding Language Model (LM) with a large Arabic corpus based LM. The results show that this interpolation is in some cases without effect on the performances of translation systems and in others is rather penalizing

    Cross-Lingual Semantic Similarity Measure for Comparable Articles

    Get PDF
    International audienceWe aim in this research to find and compare crosslingual articles concerning a specific topic. So, we need measure for that. This measure can be based on bilingual dictionaries or based on numerical methods such as Latent Semantic Indexing (LSI). In this paper, we use the LSI in two ways to retrieve Arabic-English comparable articles. The first one is monolingual: the English article is translated into Arabic and then mapped into the Arabic LSI space; the second one is crosslingual: Arabic and English documents are mapped into Arabic-English LSI space. Then, we compare LSI approaches to the dictionary-based approach on several English-Arabic parallel and comparable corpora. Results indicate that the performance of cross-lingual LSI approach is competitive to monolingual approach, or even better for some corpora. Moreover, both LSI approaches outperform the dictionary approach

    PADIC: extension and new experiments

    No full text
    International audiencePADIC is a multidialectal parallel Arabic corpus. It was composed initially by five Arabic dialects, three from the Maghreb and two from the Middle East, in addition to standard Arabic. In this paper, we present an augmented version of PADIC with a Moroccan dialect. We give also an evaluation, using the σ–index, of the computerization level of the Arabic dialects present in PADIC which reveals that these languages are really under-resourced. Several experiments in machine translation, in both sides between all the combinations of language pairs, are discussed too. For each language, we interpolated the corresponding Language Model (LM) with a large Arabic corpus based LM. The results show that this interpolation is in some cases without effect on the performances of translation systems and in others is rather penalizing

    Constitution d'un corpus de la langue Arabe à partir du Web

    No full text
    International audienceLa toile est une source intarissable de données textuelles. Ces dernières années la communauté travaillant sur les différents aspects de la langue s'est tournée vers le web afin de bénéficier de cette masse impressionnante d'informations. Cet article décrit un outil de construction de corpus pour l'Arabe. Il permet de recueillir automatiquement une liste de sites dédiés à la langue Arabe. Ensuite le contenu de ces sites est extrait et est normalisé. Le corpus ainsi constitué peut être utilisé dans diverses applications de traitement du langage naturel et plus particulièrement dans le calcul de modèles de langage statistiques

    Constitution d'un corpus de la langue Arabe à partir du Web

    No full text
    International audienceLa toile est une source intarissable de données textuelles. Ces dernières années la communauté travaillant sur les différents aspects de la langue s'est tournée vers le web afin de bénéficier de cette masse impressionnante d'informations. Cet article décrit un outil de construction de corpus pour l'Arabe. Il permet de recueillir automatiquement une liste de sites dédiés à la langue Arabe. Ensuite le contenu de ces sites est extrait et est normalisé. Le corpus ainsi constitué peut être utilisé dans diverses applications de traitement du langage naturel et plus particulièrement dans le calcul de modèles de langage statistiques

    PADIC: extension and new experiments

    Get PDF
    International audiencePADIC is a multidialectal parallel Arabic corpus. It was composed initially by five Arabic dialects, three from the Maghreb and two from the Middle East, in addition to standard Arabic. In this paper, we present an augmented version of PADIC with a Moroccan dialect. We give also an evaluation, using the σ–index, of the computerization level of the Arabic dialects present in PADIC which reveals that these languages are really under-resourced. Several experiments in machine translation, in both sides between all the combinations of language pairs, are discussed too. For each language, we interpolated the corresponding Language Model (LM) with a large Arabic corpus based LM. The results show that this interpolation is in some cases without effect on the performances of translation systems and in others is rather penalizing
    corecore